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(54) LOOP OPTIMIZATION SYSTEM 

(57)Abstract: 

PURPOSE: To facilitate application by making a compiler 
automatically judge whether optimization by the issue of 
a prefetch load instruction effective to the dynamic 
reduction of a cache miss penalty and perform the 
optimization. 

CONSTITUTION: Through an analytic path (105) for 
software pipelining optimization, information the number 
and data type of objects and the number of necessary 
registers which are required to apply prefetch loading is 
gathered, After a loop to be pipelined is selected (106), a 
developing method is determined by prefetch loading 
execution selection in 107 so that large-area register 
allocation (109) is not impeded; and an expansion 
number is divided by the number of objects so as to 
equalize the distance between preloading and preloading, 
thereby dividing parts as many as the objects. Then an 
intermediate word for prefetch loading indication is 
interposed right before each of divided part at the time 
of intermediate code output (108) after pipelining 
conversion. 
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SPECIFICATION <EXCERPT> 

[0015] 

[Embodiment] Hereinafter, the embodiment of the present invention 
shall be described with reference to the Drawings. FIG. 1 shows an 
embodiment of the present invention. As shown in FIG. 1, a compiler 
advances through lexical analysis and syntax analysis of a source code 
(101), semantic analysis and intermediate code generation (102), and 
global optimization (103). The processes up to this point and 
processes from the global register allocation (109) onward are the 
same as in a conventional method. Furthermore, in this example, 
software pipelining optimization (104) is performed after the global 
optimization. 

[0016] The present invention adds improvements to the process of 
software pipelining optimization (104), and describes the case of 
inserting prefetch loads, using a code fragment in FIG. 2. Optimization 
of such prefetch load issuing is performed only when there is an option 
specification by a user, and is not applied when the number of loops to 
be executed is small. It is to be noted that, in this example, it is 
assumed that the float-type is 4-byte, the cache line length is 32 bytes, 
and there are 32 each of the general-purpose registers and 
floating-point number registers. The loop iteration number N is 
assumed to be unknown at the time of compiling. 
[0017] In the software pipelining optimization process (104), first, 
intermediate language analysis (105) is performed, and whether or not 
to perform optimization is determined in the pipelining implementation 
selection in 106, based on a control variable or the loop structure. The 
code fragment in FIG. 2 becomes an intermediate code language such 
as that shown in FIG. 3 by the stage in which code scanning is 
performed on the intermediate code language in 105. When a 
sequence having a control variable as an additional character or a 
pointer variable involving referring to the contents to be updated in the 
loop are found during the scanning, this is counted as a prefetch load 
subject. In the example in FIG. 2, sequences a and c correspond to 
this, and thus there are two subjects. At the same time, in this path, 
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the required number of registers is calculated based on the number or 
variables and constants. In the example in FIG. 2, the floating-point 
number registers are under stricter conditions, registers are needed for 
the targets (301, 302 in FIG. 3) for multiplication with two loads and for 
a variable b. Since the variable b can be considered as a constant 
within the loop, the same register can be allocated regardless of the 
number of expansions, and thus the variable b is counted as a constant. 
[0018] Based on the information collected in 105, whether or not to 
perform pipelining optimization is determined in the pipelining 
implementation selection (106) process. When the loop is to be 
optimized as a result, the insertion method for the prefetch load is 
determined in 107, based on the information collected up to this point. 
The operational flow for 107 is shown in FIG. 4. The detailed processes 
are performed in the order shown in [1] to [3] below, and after 107 
ends, a converted intermediate code inserted with an intermediate code 
specifying the prefetch load is outputted in the converted intermediate 
code outputting (108) process. 

[0019] [1] The expansion method is determined by comparing the 
optimal number of expansions and the limit value for the number of 
expansions coming from the register resource. First, the optimal 
number of expansions is calculated in 401 by dividing the line length of 
the cache by the width of the prefetch load subject region referred to 
per iteration. In the example in FIG. 2, the line length is 32 bytes and 
the 4-byte sequence members are accessed one at a time, and thus 
34/4 is 8. Next, the expansion limit value is calculated in 402 based on 
the required number of registers in 402 so that there is no shortage of 
registers in the subsequent global register allocation (109). When the 
number of variables is v, the number of constants is c, and the number 
of registers that can be allocated is m, the maximum number of 
expansions is none other than the largest natural number n which 
satisfies the inequality expression below, 
n x v + c < m 

In the example in FIG. 2, since v=3, c=l, and m=32, the maximum 
number of expansions n becomes 10. Since the comparison of this 
result 403 is satisfied, loop expansion is performed for the optimal 
number of expansions (404). Even when assuming the case where the 
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comparison is not satisfied, the reason why the prefetch load requires 
loop expansion is different from the conventional case, and thus, by 
repeating the possible number of expansions as shown in FIG. 5, the 
code for the optimal number of expansions is repeated in order to 
enable execution in a single iteration (405). FIG. 5 is an example in 
which the optimal number of expansions 8 is implemented by repeating 
the codes for four expansions (501, 502) twice, and use of the same 
register set in 501 and 502 is made possible. 

[0020] [2] The inserting position is determined from the number of 
prefetch loads. First, the number of prefetch loads is checked (406), 
and the prefetch load is placed at the beginning of the loop when there 
is only one (407). When there are plural prefetch loads, the number of 
expansions is divided by the number of prefetch loads (408), and the 
prefetch loads are positioned in equal intervals as much as possible. 
Since there are two prefetch loads in the example in FIG. 2, codes are 
positioned at four expansions each, and prefetch loads (601, 603) 
corresponding to sequences a and c are respectively inserted at the 
beginning of each part (602, 603), as shown in FIG. 6. With this, the 
inserted prefetch loads will be in roughly equal intervals, and by cutting 
the units of the command schedule there, such interval can be 
maintained even without the command scheduler (FIG. 1, 110) having 
to pay any attention from here on. 

[0021] However, this method is one example for implementation, and it 
is also possible to have a design in which a command schedule (110) 
re-positions prefetch loads within the loop in equal intervals. 
[0022] [3] Next, the width of the prefetch is determined in 409. 
Although this is highly dependent on the hardware mechanism, 
normally, it is sufficient to have leeway just for cache miss processing to 
finish by the time a prefetched location is accessed. However, when 
there is hardware support which invalidates the command in the case 
where a prefetch load appears during another cache miss processing, a 
greater width of about one line length is taken. 

[0023] It is to be noted that, in this example, it is assumed that the 
command parallelism supported by the subject calculator is 
comparatively low. As such, since the number of expansions requiring 
software pipelining is smaller than the subject of the comparison in 403, 
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this is not taken into consideration here. Furthermore, for the same 
reason, in the global register allocation (109), the required number of 
registers is estimated under the assumption that the smallest number 
of registers is to be allocated. For a calculator having a higher 
command parallelism, it is necessary to take into consideration that 
these balances are significantly changed. 
[0024] 

[Advantageous Effects of Invention] According to the present invention, 
since optimization through the issuance of a prefetch load can be 
automatically performed by a compiler, it is possible to automatically 
reduce processor stalling which occur due to a cache miss, and 
significantly improve the execution performance of an object code. 
[Brief Description of Drawings] 

FIG. 1 is a structure diagram of a compiler utilized in the present 
invention. 

FIG. 2 is a diagram showing a source program for describing a specific 
example of the present invention. 

FIG. 3 is a diagram showing a model of intermediate code for the source 
program in FIG. 2. 

FIG. 4 is a structure diagram of a part which determines prefetch load 
implementation. 

FIG. 5 is a diagram showing an expansion method for the case where 
the number of loop expansions required for optimal load prefetch 
outputting cannot be secured due to a restriction on the register 
resources. 

FIG. 6 is a diagram showing a model of converted intermediate code for 
the source program in FIG. 2. 
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101 Lexical and syntax analysis 

102 Semantic analysis and intermediate code generation 

103 Global optimization 

104 Software pipelining optimization 

105 Intermediate language analysis and break down 

106 Pipelining implementation selection 

107 Prefetch load implementation selection 

108 Converted intermediate code outputting 

109 Global register allocation 

110 Command schedule 

111 Target code generation 
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401 Calculate optimal number of expansions o 

402 Calculate upper limit on number of expansions n 

403 o > n? 

404 Expand according to optimal number of expansions 

405 Consecutively output expansion codes 

406 Number of prefetch loads = 1? 

407 Position at beginning of loop 

408 (Number of expansions/number of loads) 
Position per iteration of expansion 

409 Determine width of prefetch 
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for (i=0;l<N;i++){ 
a[i]+=b*c[I]; 

} 

load a[i],Rl 30 
load c[i],R2 
mult b,R2,R3 
add R1.R3.K4 
store R4,a[i] 

st©, 7h7, ©4-3©^-7ttWc±?\ mm 

[0 00 4] 
load a[i+3],Rl 
load c[i+3],R2 
mult b,R3,R4 
add R5.R6.R7 
store R8,a [i] 
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for (1 = 0; 1 { n;H+){ 
Id R1.A1; 

Id Rm,Am; 
f(i,Rl Rm) 

} 

sld R01.A1; 

sld ROi, Ad; 
for (i = 0; i < n;i++){ 
sld Rll.Al; 

sld Rim, Am; 

f(i,R01 ROm) 

if ( ++ i>= n ) break; 
sld R01.A1; 

sld R0m,Am; 
f(i,Rll Rim) 
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